1 How to Use this Document

The first portion (sections 1 to 6) of this document provides an overview of the background, motivation, and design of Project STAR, as well as the shortcomings of the study design. It is mostly similar to that covered in the initial analysis report.

This final report adds to the initial analysis by extending the investigation to explore the long-term effects of small class sizes on student academic achievement. The subsequent sections (7 to 10) details procedures for addressing the new question of interest, including data preparation, variable selection, and statistical analysis.

To skip to key conclusions from the initial analysis and new results, click here.

Due to the vast nature of the data, there are many questions/caveats that can be explored, but this analysis will mainly focus on the long-term effects of class size on student academic achievement. Areas beyond this scope will be recognized but also noted that they are not within the scope of this analysis.

2 Abstract

This study investigates the short and long-term effects of small class sizes during early education on student academic achievement using data from Tennessee’s Project STAR (Student/Teacher Achievement Ratio). Employing statistical analyses, we examine how participation in small, regular, or regular-aide classes in grades K–3 impacts students’ academic performance in early elementary years, subsequent grades 4–8, high school GPA, and standardized college readiness test scores (ACT/SAT). Initial findings confirm significant immediate benefits of smaller class sizes on early academic achievement. However, longitudinal analyses on 407 students with complete data (detailed in the subsequent sections) reveal inconsistent or negligible long-term effects beyond grade 3, suggesting that early educational advantages related to class size may diminish over time. This research highlights critical insights into the durability of educational interventions and informs future policy decisions on class size reduction.

3 Introduction

Educational researchers and policymakers have long debated the impact of class size on student academic achievement. Project STAR, initiated by Tennessee in the mid-1980s, was a pioneering randomized controlled trial designed explicitly to explore this issue by assigning students to small classes (13–17 students), regular classes (22–25 students), or regular classes with a full-time teacher’s aide. This landmark study provided a unique opportunity to evaluate the immediate and long-term consequences of reduced class sizes from kindergarten through third grade and subsequently tracked student outcomes through high school.

This report builds upon previous analyses of Project STAR data, initially focused on immediate outcomes, by extending the scope to examine long-term academic impacts. Specifically, it evaluates the hypothesis that early exposure to small classes significantly affects students’ academic trajectories and readiness for higher education. The study rigorously addresses limitations such as student mobility, attrition, and potential confounding factors introduced by the operational adjustments within Project STAR. By systematically assessing academic performance across multiple stages of students’ educational careers, this analysis contributes to a nuanced understanding of whether early class-size reductions yield lasting academic benefits, thereby informing educational policy and future research directions.

4 Background & Motivation of Project STAR

Project STAR, short for the Student/Teacher Achievement Ratio Project, emerged in the mid-1980s as a pioneering effort to explore the relationship between class size and student academic outcomes. Motivated by policy concerns regarding the efficacy of smaller classes, the study was funded by the Tennessee General Assembly and implemented as a randomized controlled trial. In this experiment, students and teachers were randomly assigned to one of three classroom environments—small classes (13–17 students), regular classes (22–25 students), and regular classes with a teacher’s aide. This design was intended to isolate the effect of class size on academic performance while controlling for potential confounders, thereby providing strong evidence on the causal impacts of educational settings.

Beyond its initial focus on early childhood education, Project STAR was designed as a longitudinal study that followed students from kindergarten through third grade, and later into high school. This extended follow-up allowed researchers to investigate long-term outcomes, including high school achievement, graduation rates, and preparedness for higher education. The extensive dataset, which encompasses detailed academic records, teacher assessments, and demographic information, has been invaluable in shaping educational policy and research. By systematically analyzing the long-term effects of early educational interventions, Project STAR has contributed significantly to our understanding of how classroom environments influence academic trajectories and overall student success.

4.1 Dataset Details

library(dplyr)
library(ggplot2)
library(ggthemes)
library(knitr)
library(kableExtra)
library(patchwork)
library(car)
library(MASS)
library(broom)
library(AER)
library(foreign)
library(forcats)
library(tidyr)
library(ggalluvial)
library(plotly)
library(patchwork)
library(AER)
library(broom)

STAR_Students <- read.spss('dataverse_files/PROJECT STAR/STAR_Students.sav', to.data.frame=TRUE)
# Comparison_Students <- read.spss('dataverse_files/PROJECT STAR/Comparison_Students.sav', to.data.frame=TRUE)
STAR_K3_Schools <- read.spss('dataverse_files/PROJECT STAR/STAR_K-3_Schools.sav', to.data.frame=TRUE)
STAR_High_Schools <- read.spss('dataverse_files/PROJECT STAR/STAR_High_Schools.sav', to.data.frame=TRUE)

This investigation utilizes the STAR-and-Beyond database from the Harvard Dataverse, which contains detailed information on students, teachers, and schools involved in Project STAR. The dataset includes records from the original STAR study, as well as follow-up data from high school if available.

The primary student-level data file contains information on 11,601 students who participated in the experimental phase for at least one year between 1985 and 1989. Information for each of grades K-3 includes:

  • Demographic variables
  • School and class identifiers
  • School and teacher information
  • Experimental condition (“class type”)
  • Norm-referenced and criterion-referenced achievement test scores
  • Motivation and self-concept scores

As part of the extended follow-up, added to the records of some or all students, include:

  • Achievement test scores for the students when they were in grades 4 – 8, obtained from the Tennessee State Department of Education
  • Teachers’ ratings of student behavior in grades 4 and 8
  • Students’ self-reports of school engagement and peer effects in grade 8
  • Course taking in mathematics, science, and foreign language in high school, obtained from student transcripts
  • SAT/ACT participation and scores, obtained from ACT, Inc. and from Educational Testing Service
  • Graduation/dropout information, obtained from high school transcripts and the Tennessee State Department of Education

Note: This investigation does not necessarily encompass all variables in the dataset, but rather focuses on key areas of interest related to class size and student achievement (discussed in the subsequent sections).

5 Experimental design of Project STAR

Project STAR was initiated following the passage of House Bill (HB) 544 by the Tennessee Legislature in May 1985, aimed at investigating the effects of class size on student achievement and development in primary grades (K–3). The legislation outlined three primary research questions:

  1. What are the effects of reduced class sizes on student achievement (as measured by normed and criterion-referenced tests) and development outcomes such as self-concept and attendance?
  2. Are there cumulative advantages of remaining in smaller classes over multiple years compared to shorter-term exposure?
  3. Does specialized teacher training enhance student outcomes in reduced-size classes or classes assisted by teacher aides, compared to untrained teachers?

To implement this study, the Tennessee State Department of Education established a research consortium involving representatives from the Department, State Board of Education, State Superintendents’ Association, and four Tennessee universities. The study adhered to an experimental design, randomly assigning students entering kindergarten in 1985 or first grade in 1986 to one of three class conditions:

  • Small class (S): 13–17 students
  • Regular class (R): 22–25 students
  • Regular class with a full-time teacher aide (RA): 22–25 students

Randomization was executed by consortium members and supervised locally by university-affiliated graduate students, ensuring unbiased assignment based on gender, race, and socioeconomic status.

5.1 Selection of Schools

All Tennessee schools were invited to participate under conditions set by the state, including the random assignment requirement, maintenance of standard school policies aside from class size adjustments, and commitment for four consecutive years. Of the initially interested 180 schools, 79 were ultimately selected from 42 districts to ensure representation of inner-city, suburban, urban, and rural settings:

  • 17 inner-city schools
  • 16 suburban schools
  • 8 urban schools
  • 38 rural schools

Participation fluctuated slightly due to mergers and withdrawals, primarily attributed to challenges maintaining randomization and administrative burdens. Consequently, the number of participating schools ranged from 79 in kindergarten to 75 by third grade.

5.2 Operational Adjustments

After the initial year, STAR administrators modified the study slightly by randomly redistributing half of the students between regular (R) and regular-aide (RA) classes for subsequent years due to no significant kindergarten performance differences found between these two groups. Small-class assignments remained unchanged. This is a caveat that is addressed in the subsequent analysis.

Teacher training occurred for a subset of second-grade teachers, with no significant difference in student achievement outcomes observed between trained and untrained teachers. Student mobility also influenced class composition, with new entrants randomly assigned while maintaining small-class constraints. This “class size drift” was documented and considered in subsequent analyses.

5.3 Data Collection and Measures

Academic performance was evaluated annually using the Stanford Achievement Tests (SATs) and the Tennessee Basic Skills First (BSF) tests. Student self-concept and motivation were measured using the SCAMIN inventory. Beyond third grade, additional longitudinal data were collected, including academic performance in grades 4–8 (via the Tennessee Comprehensive Assessment Program, TCAP), student participation and identification with school surveys, college entrance examination data (ACT/SAT), high school transcripts, and graduation/dropout information.

These detailed design features and rigorous methodologies positioned Project STAR as a landmark experimental study capable of robustly determining the causal impacts of class size on educational outcomes. However, the project did not come without limitations and challenges.

6 Shortcomings of Project STAR

Despite its robust experimental design, Project STAR has several notable limitations that should be acknowledged when interpreting its findings:

6.1 Attrition and Mobility Effects

Project STAR experienced considerable student mobility, resulting in many students not remaining in their assigned class types throughout the study period. Such mobility led to a phenomenon known as “class size drift,” where the actual sizes of regular classes sometimes became similar to those of small classes, potentially diluting the experimental contrast and complicating causal inference.

6.2 Generalizability Concerns

The purposeful selection of schools, which aimed to cover diverse geographic and socioeconomic areas within Tennessee, might limit the external validity of the findings. Specifically, Project STAR schools were slightly larger and had slightly lower initial achievement scores compared to statewide averages, raising questions about how representative the findings are for other educational contexts.

6.3 Teacher Training and Implementation Fidelity

The project provided only limited teacher training, which did not specifically equip teachers to leverage smaller class sizes effectively. Additionally, training was not uniformly administered, and there was no demonstrated impact of the training itself. Thus, differences in instructional quality or consistency across classes might have influenced outcomes, independent of class size.

6.4 Short-term vs. Long-term Effects

Although the study was longitudinal, it only maintained controlled class-size conditions through grade three, after which students returned to standard-sized classes. The analysis of longer-term effects beyond third grade thus faces challenges in isolating the direct impact of early exposure to small classes from subsequent educational experiences.

6.5 Limited Control of Classroom Dynamics

Aside from controlling for class size and the presence of aides, the study deliberately maintained “normal” school operations. This approach meant that other important classroom variables, such as teaching methods, curriculum variations, and peer dynamics, remained uncontrolled, potentially confounding the observed effects.

7 Key Conclusions from the Initial Analysis Report

In the initial analysis of Project STAR data, we were primarily interested in answering the following two questions:

Primary question: Are there any differences in math scaled scores in 1st grade across class types?

Secondary question: If there are differences, which class type is associated with the highest math scaled scores in 1st grade?

To answer these questions, we adopted the following two-way ANOVA model with the following structure:

\[Y_{ijk} = \mu_{..} + \alpha_{i} + \beta_{j} + \epsilon_{ijk}\] where the index \(i\) represents the class type: small (\(i=1\)), regular (\(i=2\)), regular with aide (\(i=3\)), and the index \(j\) represents the school indicator. The rest of the parameters are as follows:

  • \(Y_{ijk}\): Mean math score of student \(k\) in class \(i\) at school \(j\).
  • \(\mu_{..}\): Overall mean math score
  • \(\alpha_{i}\): Main effect of class type \(i\) with constraint \(\sum_{i=1} \alpha_{i} = 0\)
  • \(\beta_{j}\): Main effect of school \(j\) with constraint \(\sum_{j=1} \beta_{j} = 0\)
  • \(\epsilon_{ijk}\): Random error in the math score of student \(k\) in class \(i\) at school \(j\)

The assumptions of the two-way ANOVA model are as follows:

  • Independence: The residuals \(\epsilon_{ijk}\) are independent of each other
  • Normality: The residuals \(\epsilon_{ijk}\) are normally distributed
  • Homoscedasticity: The variance of the residuals \(\epsilon_{ijk}\) is constant across all levels of the independent variables

We answered the primary question of interest by conducting an F-test to determine if there are significant differences in math scaled scores across class types. The null and alternative hypotheses were as follows:

  • \(H_0: \alpha_1 = \alpha_2 = \alpha_3 = 0\) (No significant differences in math scaled scores across class types)
  • \(H_A: \text{At least one } \alpha_i \neq 0\) (Significant differences in math scaled scores across class types)

Assumptions for the F-test include the normality of residuals and homoscedasticity, which remain the same as the two-way ANOVA model.

The F-test results indicated that the p-value for class type (star1) is less than 0.05, suggesting that there are significant differences in math scaled scores across class types. We rejected the null hypothesis and concluded that at least one class type has a significantly different mean math score compared to the others.

We implemented the Tukey HSD test to find that students in small classes have significantly higher math scores compared to students in regular classes and regular classes with an aide. However, there was no significant difference between regular classes and regular classes with an aide.

8 Caveats of Initial Analysis Report

While the initial analysis report provided some valuable insights into the short-term effects of class size on student math scores in 1st grade, several caveats and limitations should be considered:

  1. Limited Focus on Math Scaled Scores: The analysis primarily focused on math scaled scores as the outcome variable, neglecting other subjects or measures of student achievement. This narrow focus might not capture the full spectrum of educational outcomes influenced by class size. Utilizing scores from other subjects or broader achievement metrics could provide a more comprehensive understanding of the impact of class size on student learning. It would also allow for a more comprehensive comparison into the long-term effects of class size on student achievement.

  2. Short-term Analysis: The initial analysis only considered the math scores of 1st-grade students, providing a snapshot of the immediate effects of class size on academic performance. While this short-term perspective is valuable, it fails to capture the long-term implications of early educational experiences. A more comprehensive analysis that tracks student outcomes over multiple grades and years would offer a more nuanced understanding of how class size influences academic trajectories. This would require a longitudinal approach that follows students beyond the early grades and into high school and beyond.

Concern about Operational Adjustments Post-1st Grade: The initial analysis did not account for the operational adjustments made after the first year of the study, such as the redistribution of students between regular and regular-aide classes. While most students who were designated to small classes continued in that setting, students in regular and regular-aide classes were randomly reassigned. In an update in 1999, it was reported that class size and pupil teacher ratios (PTR) are not the same, and that PTR does not influence student outcomes. While there may be some concern about keeping kindergarten student data despite the switching, based on the researchers claims and results from the initial analysis, it is safe to assume that the switching between regular and regular-aide classes did not have a significant impact on the results. Therefore for this final analysis, we will continue to include K-3 data, not just grades 1-3 that had minimal switching.

data_alluvium <- subset(STAR_Students, select = c(gkclasstype, g1classtype, g2classtype, g3classtype))

class_levels <- c("small", "regular", "regular-aide")
data_alluvium <- data_alluvium %>%
  mutate(across(everything(), ~ factor(as.numeric(.),
                                       levels = c(1, 2, 3),
                                       labels = class_levels))) %>%
  mutate(across(everything(), ~ fct_explicit_na(., na_level = "Unknown (NA)")))

grade_mapping <- c("gkclasstype" = "Kindergarten",
                   "g1classtype" = "Grade 1",
                   "g2classtype" = "Grade 2",
                   "g3classtype" = "Grade 3")

pairs <- list(c("gkclasstype", "g1classtype"),
              c("g1classtype", "g2classtype"),
              c("g2classtype", "g3classtype"))

flow_list <- lapply(pairs, function(x) {
  data_alluvium %>%
    group_by(across(all_of(x))) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(
      source = paste(grade_mapping[x[1]], ": ", .[[x[1]]], sep = ""),
      target = paste(grade_mapping[x[2]], ": ", .[[x[2]]], sep = "")
    ) %>%
    select(source, target, value)
})

transitions <- bind_rows(flow_list)

nodes <- unique(c(transitions$source, transitions$target))
nodes_df <- data.frame(name = nodes, stringsAsFactors = FALSE)
nodes_df$index <- seq_len(nrow(nodes_df)) - 1

transitions <- transitions %>%
  left_join(nodes_df, by = c("source" = "name")) %>%
  rename(source_index = index) %>%
  left_join(nodes_df, by = c("target" = "name")) %>%
  rename(target_index = index)


get_class <- function(x) sub(".*: ", "", x)

nodes_df$class_type <- sapply(nodes_df$name, get_class)


class_colors <- c(
  "small"        = "#F5C310",  # yellow
  "regular"      = "#0072B2",  # blue
  "regular-aide" = "#009E73",  # green
  "Unknown (NA)" = "#D55E00"   # reddish/orange
)

nodes_df$color <- class_colors[nodes_df$class_type]


transitions$link_color <- scales::alpha(nodes_df$color[ match(transitions$source, nodes_df$name) ], 0.4)



p <- plot_ly(
  type = "sankey",
  orientation = "h",

  arrangement = "snap",
  node = list(
    label = nodes_df$name,
    pad = 15,
    thickness = 20,
    line = list(color = "black", width = 0.5),
    color = nodes_df$color
  ),
  link = list(
    source = transitions$source_index,
    target = transitions$target_index,
    value = transitions$value,
    color = transitions$link_color,
    opacity = 0.1
  )
) %>%
  layout(
    title = "Alluvial Diagram of Students' Transfer",
    font = list(size = 10),
    width = 800
  )

p
data_alluvium <- STAR_Students %>%
  select(gkclasstype, g1classtype, g2classtype, g3classtype) %>%
  drop_na() %>%
  mutate(across(everything(), ~ factor(as.numeric(.),
                                       levels = c(1, 2, 3),
                                       labels = c("small", "regular", "regular-aide"))))

grade_mapping <- c("gkclasstype" = "Kindergarten",
                   "g1classtype" = "Grade 1",
                   "g2classtype" = "Grade 2",
                   "g3classtype" = "Grade 3")


pairs <- list(c("gkclasstype", "g1classtype"),
              c("g1classtype", "g2classtype"),
              c("g2classtype", "g3classtype"))

flow_list <- lapply(pairs, function(x) {
  data_alluvium %>%
    group_by(across(all_of(x))) %>%
    summarise(value = n(), .groups = "drop") %>%
    mutate(
      source = paste(grade_mapping[x[1]], ": ", .[[x[1]]], sep = ""),
      target = paste(grade_mapping[x[2]], ": ", .[[x[2]]], sep = "")
    ) %>%
    select(source, target, value)
})


transitions <- bind_rows(flow_list)


nodes <- unique(c(transitions$source, transitions$target))
nodes_df <- data.frame(name = nodes, stringsAsFactors = FALSE)
nodes_df$index <- seq_len(nrow(nodes_df)) - 1


transitions <- transitions %>%
  left_join(nodes_df, by = c("source" = "name")) %>%
  rename(source_index = index) %>%
  left_join(nodes_df, by = c("target" = "name")) %>%
  rename(target_index = index)


get_class <- function(x) sub(".*: ", "", x)
nodes_df$class_type <- sapply(nodes_df$name, get_class)

class_colors <- c(
  "small"        = "#F5C310",  # yellow
  "regular"      = "#0072B2",  # blue
  "regular-aide" = "#009E73"   # green
)


nodes_df$color <- class_colors[nodes_df$class_type]


transitions$link_color <- scales::alpha(nodes_df$color[ match(transitions$source, nodes_df$name) ], 0.4)


p <- plot_ly(
  type = "sankey",
  orientation = "h",
  arrangement = "snap",
  node = list(
    label = nodes_df$name,
    pad = 15,
    thickness = 20,
    line = list(color = "black", width = 0.5),
    color = nodes_df$color
  ),
  link = list(
    source = transitions$source_index,
    target = transitions$target_index,
    value = transitions$value,
    color = transitions$link_color
  )
) %>%
  layout(
    title = "Alluvial Diagram of Students' Transfer (NA Removed)",
    font = list(size = 10),
    width = 800
  )

p

In the figures above, we see a big shift between regular and regular-aide classes from kindergarten to grade 1. However, we do not consider this a significant issue. Investigating edge cases with some students switching between small and regular classes will be interesting, but is beyond the scope of this analysis.

With these caveats in mind, we aim to extend the analysis of Project STAR data to explore the long-term effects of class size on student academic achievement. The natural new question of interest is:

For students who complete both primary and secondary education with the objective of pursuing higher education (college), does the exposure to small class sizes in early education (K-3) have a significant impact on their high school academic performance and college readiness?

9 Answering the New Question of Interest by Addressing the Caveats

One more point to keep in mind is that Tennessee implemented a new student assessment system the year STAR students entered grade 4, the Tennessee Comprehensive Assessment Program (TCAP). The TCAP assessment battery included norm-referenced tests from the Comprehensive Tests of Basic Skills (CTBS/McGraw Hill, 1989) and BSF criterion-referenced tests for each grade in reading and mathematics. Scores on these tests were made available by the Tennessee State Department of Education, as students progressed from grade 4 (1989-1990) through grade 8 (1993-1994).

The user guide notes that “Scores on the CTBS are not directly comparable to those on the SATs. However, IRT scale scores were available for each CTBS subtest so that comparisons can be made meaningfully across grades 4—8.” Hence the scaled scores are valid for comparison across grades 4-8.

9.1 Subsetting the Data as New Population of Interest to Answer the Extended Question

The new question of interest requires a subset of students who have completed both primary and secondary education, and that we have complete data on their academic performance and college readiness. To achieve this, we will filter the dataset to include only students who have complete data using the flags in the data file the binary flag variables indicate participation/non-participation at each stage of data collection.

Flag Type Variable Name Description
Achievement-data Flags flaggk Achievement data available kindergarten
flagg1 Achievement data available grade 1
flagg2 Achievement data available grade 2
flagg3 Achievement data available grade 3
flagg4 Achievement data available grade 4
flagg5 Achievement data available grade 5
flagg6 Achievement data available grade 6
flagg7 Achievement data available grade 7
flagg8 Achievement data available grade 8
High School Data Flags flagsatact Valid SAT/ACT score available
flaghscourse At least two years of high school course data
flaghsgraduate Data on high school graduation status available
Additional STAR Indicators cmpstype Class type composite 10988 valid, 613 missing
cmpsdura Duration composite 10988 valid, 613 missing

Our subset will require Achievement-data Flags and High School Data Flags to be “YES” for each student. This ensures that we have complete data on student achievement from kindergarten through grade 8 and high school graduation status.

We also require that the composite class type and duration variables (cmpstype and cmpsdura) are not missing, as these are essential for our analysis. Missing values in these two variables is another caveat that can be addressed, but it is beyond the scope of this particular analysis.

We should note that the question of interest is specific to students who have completed both primary and secondary education. Therefore any concerns for selection bias is beyond the scope of this analysis.

STAR_flag_vars <- c('FLAGSGK',
                    'FLAGSG1',
                    'FLAGSG2',
                    'FLAGSG3')

Achievement_flag_vars <- c('flaggk',
                           'flagg1',
                           'flagg2',
                           'flagg3',
                           'flagg4',
                           'flagg5',
                           'flagg6',
                           'flagg7',
                           'flagg8')

HS_College_flag_var <- c('flagsatact',
                         'flaghscourse',
                         'flaghsgraduate')

# Subset students who have Achievement_flag_vars and HS_College_flag_var columns == 'YES'
All_Grades_Students <- STAR_Students %>%
  filter(if_all(all_of(Achievement_flag_vars), ~ . == "YES") & 
         if_all(all_of(HS_College_flag_var), ~ . == "YES") &
          # if `cmpstype` is not missing
         !is.na(cmpstype) & !is.na(cmpsdura)
         
         )

9.1.1 Data Completeness and Proportion of Students with Complete Data for Longitudinal Analysis

Achievement_data_students <- STAR_Students %>%
  filter(if_all(all_of(Achievement_flag_vars), ~ . == "YES"))

HS_College_data_students <- STAR_Students %>%
  filter(if_all(all_of(HS_College_flag_var), ~ . == "YES"))

student_count <- nrow(STAR_Students)


complete_student_count1 <- nrow(Achievement_data_students)
incomplete_student_count1 <- student_count - complete_student_count1
counts1 <- c(complete_student_count1, incomplete_student_count1)
percent1 <- round(100 * counts1 / student_count, 1)
labels1 <- paste(c("Data Available", "Data \n Unavailable"), 
                 "\nCount:", counts1, 
                 "\n", percent1, "%")


complete_student_count2 <- nrow(HS_College_data_students)
incomplete_student_count2 <- student_count - complete_student_count2
counts2 <- c(complete_student_count2, incomplete_student_count2)
percent2 <- round(100 * counts2 / student_count, 1)
labels2 <- paste(c("Data Available", "Data \n Unavailable"), 
                 "\nCount:", counts2, 
                 "\n", percent2, "%")


complete_student_count3 <- nrow(All_Grades_Students)
incomplete_student_count3 <- student_count - complete_student_count3
counts3 <- c(complete_student_count3, incomplete_student_count3)
percent3 <- round(100 * counts3 / student_count, 1)
labels3 <- paste(c("Data Available", "Data \n Unavailable"), 
                 "\nCount:", counts3, 
                 "\n", percent3, "%")


par(mfrow = c(2, 2), mar = c(1, 4, 1, 4))

pie(counts1, labels = labels1, 
    main = "Achievement Data Completeness", col = c("lightblue", "lightgray"))
pie(counts2, labels = labels2, 
    main = "High School Data Completeness", col = c("lightgreen", "lightgray"))
pie(counts3, labels = labels3, 
    main = "Achievement & High School Completeness", col = c("lightcoral", "lightgray"))


par(mfrow = c(1, 1))

We notice that the proportion of students with complete data for both achievement and high school/college readiness is relatively low compared to the total number of students in the dataset. This highlights the challenges associated with longitudinal studies and the importance of data completeness for robust analyses. Despite the limitations, we are still left with 506 students who have complete data for the longitudinal analysis.

9.1.2 Keeping Students who have been in a Specific Class Type for at Least 3 Years

Our analysis focuses on the long-term effects of class size on student academic achievement. To ensure that we capture the cumulative impact of class size exposure, we will filter the dataset to include only students who have been in a specific class type for at least three years. Three years or more is chosen to account for the fact that kindergarten attendance was not mandatory by law or by study design. We utilize the cmpsdura variable, which represents the duration of exposure to a specific class type.

This decision addresses several concerns:

  1. Class Switching: We noticed a small amount of class switching between all three class types (some types more than others), and ensuring that students have been in a specific class type for at least three years minimizes the impact of these switches on the analysis. In other words, we are interested in students who have not switched classes frequently.
  2. Students who joined/dropped out of STAR: Some students may have joined or dropped out of the STAR program during the study period. By focusing on students who have been in a specific class type for at least three years, we essentially eliminate students who have not had a consistent class type exposure.
  3. Cumulative Impact: The three-year threshold allows us to capture the cumulative impact of early education class size on student academic outcomes.
cmpsdura_table_original <- All_Grades_Students %>%
  filter(!is.na(cmpsdura)) %>%
  group_by(cmpsdura) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(percentage = sprintf("%.2f%%", count / sum(count) * 100))

head(cmpsdura_table_original) %>%
  kable(caption = "Distribution of Duration Composite (`cmpsdura`) in K-12 Data Available Students") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Distribution of Duration Composite (cmpsdura) in K-12 Data Available Students
cmpsdura count percentage
1 76 15.02%
2 13 2.57%
3 98 19.37%
4 319 63.04%
All_Grades_Students <- All_Grades_Students %>%
  filter(cmpsdura > 2)

We see that a majority (407 students) out of the 506 students with complete data have been in a specific class type for at least three years.

Any results or conclusions beyond this point are based on these 407 students who have complete data for the analysis.

As with the initial analysis, any conclusions made from this report can only be generalized to the population of students who have complete data and have been in a specific class type for at least three years and have completed their education in Tennessee.

9.1.3 Variables of Interest for Longitudinal Analysis

We keep the following variables for the longitudinal analysis:

Type Variable Name Description
Demographic Variables stdntid Student ID
Class Type & STAR Participation gkclasstype Class type in kindergarten
g1classtype Class type in grade 1
g2classtype Class type in grade 2
g3classtype Class type in grade 3
cmpstype Class type composite
cmpsdura Duration composite
yearsstar Number of years in STAR
Reading & Math Scores gktreadss Total reading scaled score SAT kindergarten
gktmathss Total math scaled score SAT kindergarten
g1treadss Total reading scale scores SAT Grade 1
g1tmathss Total math scale scores SAT Grade 1
g2treadss Total math scale scores SAT Grade 2
g2tmathss Total reading scale scores SAT Grade 2
g3treadss Total math scale scores SAT Grade 3
g3tmathss Total reading scale scores SAT Grade 3
g4treadss Total math scale scores CTBS Grade 4
g4tmathss Total reading scale scores CTBS Grade 4
g5treadss Total math scale scores CTBS Grade 5
g5tmathss Total reading scale scores CTBS Grade 5
g6treadss Total math scale scores CTBS Grade 6
g6tmathss Total reading scale scores CTBS Grade 6
g7treadss Total math scale scores CTBS Grade 7
g7tmathss Total reading scale scores CTBS Grade 7
g8treadss Total math scale scores CTBS Grade 8
g8tmathss Total reading scale scores CTBS Grade 8
High School Performance hsgpaoverall GPA overall high school
hsactcomp ACT composite score high school
hsactconverted SAT –> ACT (test score reported in ACT composite metric) high school
SAT_students <- STAR_Students %>%
  filter(hssat == "YES" & hsact == "NO")

ACT_students <- STAR_Students %>%
  filter(hsact == "YES" & hssat == "NO")

SAT_ACT_students <- STAR_Students %>%
  filter(hssat == "YES" & hsact == "YES")

no_SAT_ACT_students <- STAR_Students %>%
  filter(hssat == "NO" & hsact == "NO")

num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
num_no_SAT_ACT_students <- nrow(no_SAT_ACT_students)


df <- data.frame(
  Category = c("SAT Only", "ACT Only", "Both SAT & ACT", "Neither SAT nor ACT"),
  Count = c(num_SAT_students, num_ACT_students, num_SAT_ACT_students, num_no_SAT_ACT_students)
)


df$Percentage <- round(df$Count / sum(df$Count) * 100, 1)


pie_chart4 <- plot_ly(
  data = df,
  labels = ~Category,
  values = ~Percentage, 
  type = 'pie',
  textinfo = 'percent',
  hoverinfo = 'label+percent+text',
  text = ~paste0("Count: ", Count),
  marker = list(colors = c("lightblue", "lightgreen", "lightcoral", "lightgray")),
  title = "SAT and ACT Participation (Original Data)",
  width = 500, height = 400
)

SAT_students <- All_Grades_Students %>%
  filter(hssat == "YES" & hsact == "NO")

ACT_students <- All_Grades_Students %>%
  filter(hsact == "YES" & hssat == "NO")

SAT_ACT_students <- All_Grades_Students %>%
  filter(hssat == "YES" & hsact == "YES")

no_SAT_ACT_students <- All_Grades_Students %>%
  filter(hssat == "NO" & hsact == "NO")

num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
num_no_SAT_ACT_students <- nrow(no_SAT_ACT_students)


df <- data.frame(
  Category = c("SAT Only", "ACT Only", "Both SAT & ACT", "Neither SAT nor ACT"),
  Count = c(num_SAT_students, num_ACT_students, num_SAT_ACT_students, num_no_SAT_ACT_students)
)


df$Percentage <- round(df$Count / sum(df$Count) * 100, 1)


pie_chart5 <- plot_ly(
  data = df,
  labels = ~Category,
  values = ~Percentage, 
  type = 'pie',
  textinfo = 'percent',
  hoverinfo = 'label+percent+text',
  text = ~paste0("Count: ", Count),
  marker = list(colors = c("lightblue", "lightgreen", "lightcoral", "lightgray")),
  title = "SAT and ACT Participation (subsetted students)",
  width = 500, height = 400
)

# Subset data for STAR_Students
SAT_students <- STAR_Students[STAR_Students$hssat == "YES" & STAR_Students$hsact == "NO", ]
ACT_students <- STAR_Students[STAR_Students$hsact == "YES" & STAR_Students$hssat == "NO", ]
SAT_ACT_students <- STAR_Students[STAR_Students$hssat == "YES" & STAR_Students$hsact == "YES", ]
no_SAT_ACT_students <- STAR_Students[STAR_Students$hssat == "NO" & STAR_Students$hsact == "NO", ]

num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
num_no_SAT_ACT_students <- nrow(no_SAT_ACT_students)

# Data for first pie chart
counts1 <- c(num_ACT_students, num_SAT_students, num_SAT_ACT_students, num_no_SAT_ACT_students)
labels1 <- c("ACT Only", "SAT Only", "Both SAT & ACT", "Neither SAT nor ACT")
percentages1 <- round(counts1 / sum(counts1) * 100, 1)
labels1 <- paste(labels1, "\n", percentages1, "%")

# Subset data for All_Grades_Students
SAT_students <- All_Grades_Students[All_Grades_Students$hssat == "YES" & All_Grades_Students$hsact == "NO", ]
ACT_students <- All_Grades_Students[All_Grades_Students$hsact == "YES" & All_Grades_Students$hssat == "NO", ]
SAT_ACT_students <- All_Grades_Students[All_Grades_Students$hssat == "YES" & All_Grades_Students$hsact == "YES", ]

num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)

# Data for second pie chart
counts2 <- c(num_ACT_students, num_SAT_students, num_SAT_ACT_students)
labels2 <- c("ACT Only", "SAT Only", "Both SAT & ACT")
percentages2 <- round(counts2 / sum(counts2) * 100, 1)
labels2 <- paste(labels2, "\n", percentages2, "%")

# Set graphical parameters for side-by-side plots
par(mfrow = c(1, 2), mar = c(1, 4, 10, 4))

# Pie chart for STAR_Students
pie(counts1, labels = labels1, col = c("lightgreen", "lightblue", "lightcoral", "lightgray"),
    main = "SAT & ACT Participation \n (Original Data)")

# Pie chart for All_Grades_Students
pie(counts2, labels = labels2, col = c("lightgreen", "lightblue", "lightcoral"),
    main = "SAT & ACT Participation \n (Subsetted Students - \n New Population)")

# Reset graphical parameters
par(mfrow = c(1, 1))

To help us utilize standardized college entrance exam scores, we need both ACT and SAT scores to be comparable (i.e. on the same scale). The ACT scores can range between 1 to 36, while the SAT scores can range between 400 to 1600 during the time of the study. While the SAT score ranges changed over time, at the time of this report, it has reverted back to the same 400-1600 scale.

Converting the scores to match either exam scale will inevitably lead to some loss of information, but it is necessary for meaningful comparisons. To minimize this loss of information, we will keep the popular ACT composite score, hsactcomp (taken by 90% of new sample of interest), as is and utilize the converted SAT composite to ACT score (hsactconverted) for the analysis.

variables_of_interest <- c(
  'stdntid',
  'gkclasstype',
  'g1classtype',
  'g2classtype',
  'g3classtype',
  'cmpstype',
  'cmpsdura',
  'yearsstar',
  'gktreadss',
  'gktmathss',
  'g1treadss',
  'g1tmathss',
  'g2treadss',
  'g2tmathss',
  'g3treadss',
  'g3tmathss',
  'g4treadss',
  'g4tmathss',
  'g5treadss',
  'g5tmathss',
  'g6treadss',
  'g6tmathss',
  'g7treadss',
  'g7tmathss',
  'g8treadss',
  'g8tmathss',
  'hsgpaoverall',
  'hsactcomp',
  'hsactconverted'
)

# Keep variables_of_interest from All_Grades_Students

subsetted_data <- All_Grades_Students %>%
  select(all_of(variables_of_interest))

# Table of percentages and counts for class type (`cmpstype`)

class_type_table_original <- STAR_Students %>%
  filter(!is.na(cmpstype)) %>%
  group_by(cmpstype) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(percentage = sprintf("%.2f%%", count / sum(count) * 100))

head(class_type_table_original) %>%
  kable(caption = "Distribution of Class Types (`cmpstype`) in Original Data") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Distribution of Class Types (cmpstype) in Original Data
cmpstype count percentage
SMALL 3202 29.14%
REGULAR 3045 27.71%
AIDE 4741 43.15%
class_type_table <- subsetted_data %>%
  group_by(cmpstype) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(percentage = sprintf("%.2f%%", count / sum(count) * 100))

# Print using kable
head(class_type_table) %>%
  kable(caption = "Distribution of Class Types (`cmpstype`) in Subsetted Data") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Distribution of Class Types (cmpstype) in Subsetted Data
cmpstype count percentage
SMALL 167 40.05%
REGULAR 66 15.83%
AIDE 184 44.12%

9.2 Supporting Analysis 1: Verifying Short Term Results with Subsetted Data

In this section, we will verify the initial analysis results with the new sample/population of interest. The initial analysis used the scaled math scores of 1st-grade students to investigate the impact of class size on academic performance. The two-way ANOVA model showed that the school (schoolid) did not have a significant effect on math scores, while the class type (cmpstype) did have a significant effect. We will replicate this analysis with the new subset of students.

# Histogram of math scaled scores for grade 1 students (original data)

math_hist_original <- ggplot(STAR_Students, aes(x = g1tmathss)) +
                          geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
                          labs(title = "Distribution of Math Scaled Scores (Grade 1)",
                               x = "Math Scaled Score",
                               y = "Frequency") +
                          theme_minimal()

# Histogram of math scaled scores for grade 1 students (subsetted data)

math_hist_subset <- ggplot(subsetted_data, aes(x = g1tmathss)) +
                        geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
                        labs(title = "Distribution of Math Scaled Scores (Grade 1)",
                             x = "Math Scaled Score",
                             y = "Frequency") +
                        theme_minimal()

# Side by side boxplots of math scaled scores by class type (`g1classtype`) for original and subsetted data

math_boxplot <- ggplot(STAR_Students, aes(x = g1classtype, y = g1tmathss)) +
                    geom_boxplot(fill = "skyblue", color = "black") +
                    geom_jitter(width = 0.2, color = "navy", alpha = 0.05) +
                    labs(title = "Math Scores by Class Type (Original)",
                         x = "Class Type",
                         y = "Math Scaled Score") +
                    theme_minimal()

math_boxplot_subset <- ggplot(subsetted_data, aes(x = g1classtype, y = g1tmathss)) +
                          geom_boxplot(fill = "skyblue", color = "black") +
                          geom_jitter(width = 0.2, color = "navy", alpha = 0.5) +
                          labs(title = "Math Scores by Class Type (New)",
                               x = "Class Type",
                               y = "Math Scaled Score") +
                          theme_minimal()



math_hist_original + math_hist_subset

math_boxplot + math_boxplot_subset

9.2.1 One-Way ANOVA for Math Scaled Scores by Class Type

We will conduct a one-way ANOVA to test the null hypothesis that the mean math scaled scores are the same across different class types in grade 1 in this new population. The alternative hypothesis is that at least one class type has a different mean math scaled score.

The model is as follows:

\[\begin{align*} Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \end{align*}\]

where:

  • \(Y_{ij}\): Mean math scaled score for the \(j\)-th student in the \(i\)-th class type
  • \(\mu\): Overall mean math scaled score
  • \(\alpha_i\): Main effect of the \(i\)-th class type
  • \(\epsilon_{ij}\): Random error in the math scaled score for the student \(j\) in the \(i\)-th class type

The independence, normality, and homoscedasticity assumptions are required for the validity of the ANOVA results.

# Main effect plot of class type

class_type_effect_plot_info <- subsetted_data %>%
  group_by(g1classtype) %>%
  summarise(mean_math_score = mean(g1tmathss, na.rm = TRUE),
            SE = sd(g1tmathss, na.rm = TRUE) / sqrt(n()),
            Lower_CI = mean_math_score - qt(0.975, n() - 1) * SE,
            Upper_CI = mean_math_score + qt(0.975, n() - 1) * SE)

class_type_effect_plot <- ggplot(class_type_effect_plot_info, aes(x = g1classtype, y = mean_math_score)) +
                          geom_point(size = 3, color = "skyblue") +
                          geom_line(aes(group = 1), color = "skyblue") +
                          geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
                          labs(title = "Main Effect of Class Type on Math Scores (Grade 1)",
                               x = "Class Type",
                               y = "Mean Math Scaled Score") +
                          theme_minimal()

class_type_effect_plot

# One-way ANOVA model
aov1 <- aov(g1tmathss ~ g1classtype, data = subsetted_data)

# Summary of ANOVA
aov1_summary <- broom::tidy(aov1)

kable(aov1_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 1) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Math Scaled Scores by Class Type (Grade 1) in New Population of Interest
term df sumsq meansq statistic p.value
g1classtype 2 24343.58 12171.791 8.963494 0.0001546
Residuals 413 560824.79 1357.929 NA NA

Like the initial analysis, the one-way ANOVA results show that the class type has a significant effect on math scaled scores in grade 1 for the new population of interest.

In the initial analysis, the Tukey’s HSD post-hoc test identified that students in small classes have significantly higher math scores compared to students in either regular or regular-aide classes. We will conduct the same post-hoc test to identify which class types have significantly different math scores in this new population.

aov1_tukey_results <- TukeyHSD(aov1)

kable(tidy(aov1_tukey_results), 
      caption = "Tukey's HSD Post-hoc Test for Math Scaled Scores by Class Type (Grade 1) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Tukey’s HSD Post-hoc Test for Math Scaled Scores by Class Type (Grade 1) in New Population of Interest
term contrast null.value estimate conf.low conf.high adj.p.value
g1classtype REGULAR CLASS-SMALL CLASS 0 -19.553846 -32.215146 -6.892546 0.0009198
g1classtype REGULAR + AIDE CLASS-SMALL CLASS 0 -13.371585 -22.633163 -4.110006 0.0021610
g1classtype REGULAR + AIDE CLASS-REGULAR CLASS 0 6.182262 -6.333442 18.697965 0.4767002
aov1_tukey_df <- as.data.frame(aov1_tukey_results$g1classtype)
aov1_tukey_df$comparison <- rownames(aov1_tukey_df)

colnames(aov1_tukey_df) <- c("Difference", "Lower_CI", "Upper_CI", "p_value", "Comparison")

ggplot(aov1_tukey_df, aes(x = Comparison, y = Difference)) +
  geom_point(aes(color = p_value < 0.05), size = 3) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Tukey's HSD Post-hoc Test for Math Scaled Scores by Class Type (Grade 1)",
       x = "Class Type Comparison",
       y = "Difference in Mean Math Scaled Scores") +
  coord_flip() +
  theme_minimal()

We observe that the Tukey’s HSD post-hoc test results are consistent with the initial analysis. Students in small classes have significantly higher math scores compared to students in regular and regular-aide classes in this new population of interest.

9.2.2 One way ANOVA for Reading Scaled Scores by Class Type

Although not covered in the initial report, we want to also verify the short term effect of class size on reading scaled scores in grade 1 for the new population of interest.

# One-way ANOVA model for reading scaled scores by class type (grade 1)
aov1_reading <- aov(g1treadss ~ g1classtype, data = subsetted_data)

# Summary of ANOVA
aov1_reading_summary <- broom::tidy(aov1_reading)

kable(aov1_reading_summary, caption = "ANOVA Summary for Reading Scaled Scores by Class Type (Grade 1) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Reading Scaled Scores by Class Type (Grade 1) in New Population of Interest
term df sumsq meansq statistic p.value
g1classtype 2 40819.13 20409.563 8.565375 0.0002274
Residuals 404 962650.57 2382.798 NA NA
# Tukey's HSD post-hoc test for reading scaled scores by class type (grade 1)

aov1_reading_tukey_results <- TukeyHSD(aov1_reading)

kable(tidy(aov1_reading_tukey_results), 
      caption = "Tukey's HSD Post-hoc Test for Reading Scaled Scores by Class Type (Grade 1) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Tukey’s HSD Post-hoc Test for Reading Scaled Scores by Class Type (Grade 1) in New Population of Interest
term contrast null.value estimate conf.low conf.high adj.p.value
g1classtype REGULAR CLASS-SMALL CLASS 0 -28.95451 -45.975017 -11.9340030 0.0002204
g1classtype REGULAR + AIDE CLASS-SMALL CLASS 0 -13.18388 -25.579625 -0.7881252 0.0340065
g1classtype REGULAR + AIDE CLASS-REGULAR CLASS 0 15.77063 -1.038617 32.5798866 0.0711872
aov1_reading_tukey_df <- as.data.frame(aov1_reading_tukey_results$g1classtype)
aov1_reading_tukey_df$comparison <- rownames(aov1_reading_tukey_df)

colnames(aov1_reading_tukey_df) <- c("Difference", "Lower_CI", "Upper_CI", "p_value", "Comparison")

ggplot(aov1_reading_tukey_df, aes(x = Comparison, y = Difference)) +
  geom_point(aes(color = p_value < 0.05), size = 3) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Tukey's HSD Post-hoc Test for Reading Scaled Scores by Class Type (Grade 1)",
       x = "Class Type Comparison",
       y = "Difference in Mean Reading Scaled Scores") +
  coord_flip() +
  theme_minimal()

We can observe that the results for reading scaled scores are consistent with the math scaled scores. Students in small classes have significantly higher reading scores compared to students in regular and regular-aide classes in this new population of interest.

9.2.3 Summary of Supporting Analysis 1

We see that for this immediate (very short term) effect of class size on academic performance, the results are consistent with the initial analysis.

9.3 Supporting Analysis 2: Impact of Class Size on Grades 4-8 Achievement

We will now investigate the impact of class size on student achievement in grades beyond STAR (grades 4-8). We will focus on reading and math scaled scores for grades 4-8 and analyze the effect of class type on these scores. Since we are now exploring effects after the STAR program, we will use cmpstype as the class type variable, which at this point represents the class type that the student spent at least 3-4 years in during K-3.

9.3.1 One-Way ANOVA for Math Scaled Scores by Class Type (Grades 4-8)

We will conduct a one-way ANOVA to test the null hypothesis that the mean math scaled scores are the same across different class types in grades 4-8. The alternative hypothesis is that at least one class type has a different mean math scaled score.

The model for each grade is as follows:

\[\begin{align*} Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \end{align*}\]

where:

  • \(Y_{ij}\): Mean math scaled score for the \(j\)-th student in the \(i\)-th class type
  • \(\mu\): Overall mean math scaled score
  • \(\alpha_i\): Main effect of the \(i\)-th class type
  • \(\epsilon_{ij}\): Random error in the math scaled score for the student \(j\) in the \(i\)-th class type

The independence, normality, and homoscedasticity assumptions are required for the validity of the ANOVA results.

## Main effects plot of class type for math scores in grade 4
# 
# class_type_effect_plot_info <- subsetted_data %>%
#   group_by(cmpstype) %>%
#   summarise(mean_g4_math = mean(g4tmathss, na.rm = TRUE),
#             SE = sd(g4tmathss, na.rm = TRUE) / sqrt(n()),
#             Lower_CI = mean_g4_math - qt(0.975, n() - 1) * SE,
#             Upper_CI = mean_g4_math + qt(0.975, n() - 1) * SE)
# 
# class_type_effect_plot <- ggplot(class_type_effect_plot_info, aes(x = cmpstype, y = mean_g4_math)) +
#                           geom_point(size = 3, color = "skyblue") +
#                           geom_line(aes(group = 1), color = "skyblue") +
#                           geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
#                           labs(title = "Main Effect of Class Type on Math Scores (Grade 4)",
#                                x = "Class Type",
#                                y = "Mean Math Scaled Score") +
#                           theme_minimal()
# 
# class_type_effect_plot
# One-way ANOVA model for math scaled scores by class type (grades 4-8)

aov_g4_math <- aov(g4tmathss ~ cmpstype, data = subsetted_data)
aov_g5_math <- aov(g5tmathss ~ cmpstype, data = subsetted_data)
aov_g6_math <- aov(g6tmathss ~ cmpstype, data = subsetted_data)
aov_g7_math <- aov(g7tmathss ~ cmpstype, data = subsetted_data)
aov_g8_math <- aov(g8tmathss ~ cmpstype, data = subsetted_data)

# Kable for ANOVA results
aov_g4_math_summary <- broom::tidy(aov_g4_math)
aov_g5_math_summary <- broom::tidy(aov_g5_math)
aov_g6_math_summary <- broom::tidy(aov_g6_math)
aov_g7_math_summary <- broom::tidy(aov_g7_math)
aov_g8_math_summary <- broom::tidy(aov_g8_math)

kable(aov_g4_math_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 4) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Math Scaled Scores by Class Type (Grade 4) in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 5151.582 2575.791 1.887543 0.1527476
Residuals 413 563590.627 1364.626 NA NA
kable(aov_g5_math_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 5) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Math Scaled Scores by Class Type (Grade 5) in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 2126.525 1063.263 0.8839805 0.4139153
Residuals 413 496761.437 1202.812 NA NA
kable(aov_g6_math_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 6) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Math Scaled Scores by Class Type (Grade 6) in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 2833.384 1416.692 1.010973 0.3647677
Residuals 411 575940.375 1401.315 NA NA
kable(aov_g7_math_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 7) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Math Scaled Scores by Class Type (Grade 7) in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 6899.988 3449.994 2.269325 0.1046665
Residuals 414 629393.139 1520.273 NA NA
kable(aov_g8_math_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 8) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Math Scaled Scores by Class Type (Grade 8) in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 5313.938 2656.969 1.630356 0.1971178
Residuals 413 673060.445 1629.686 NA NA
# Line plot of mean math scores by class type for grades 4-8

math_scores_by_grade <- subsetted_data %>%
  group_by(cmpstype) %>%
  summarise(mean_g4_math = mean(g4tmathss, na.rm = TRUE),
            mean_g5_math = mean(g5tmathss, na.rm = TRUE),
            mean_g6_math = mean(g6tmathss, na.rm = TRUE),
            mean_g7_math = mean(g7tmathss, na.rm = TRUE),
            mean_g8_math = mean(g8tmathss, na.rm = TRUE))

math_scores_by_grade_long <- math_scores_by_grade %>%
  pivot_longer(cols = starts_with("mean"), names_to = "Grade", values_to = "Mean_Math_Score")

math_scores_plot1 <- ggplot(math_scores_by_grade_long, aes(x = cmpstype, y = Mean_Math_Score, color = Grade)) +
                    geom_point(size = 3) +
                    geom_line(aes(group = Grade)) +
                    labs(title = "Mean Math Scaled Scores by Class Type (Grades 4-8)",
                         x = "Class Type",
                         y = "Mean Math Scaled Score") +
                    theme_minimal()

math_scores_by_grade <- subsetted_data %>%
  group_by(cmpstype) %>%
  summarise(mean_g4_math = mean(g4tmathss, na.rm = TRUE),
            mean_g5_math = mean(g5tmathss, na.rm = TRUE),
            mean_g6_math = mean(g6tmathss, na.rm = TRUE),
            mean_g7_math = mean(g7tmathss, na.rm = TRUE),
            mean_g8_math = mean(g8tmathss, na.rm = TRUE)) %>%
  pivot_longer(cols = starts_with("mean"), 
               names_to = "Grade", 
               values_to = "Mean_Math_Score") %>%
  mutate(Grade = factor(Grade,
                        levels = c("mean_g4_math", "mean_g5_math", "mean_g6_math", "mean_g7_math", "mean_g8_math"),
                        labels = c("4", "5", "6", "7", "8")))


math_scores_plot2 <- ggplot(math_scores_by_grade, aes(x = Grade, y = Mean_Math_Score, group = cmpstype, color = cmpstype)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(x = "Grade",
       y = "Mean Math Scaled Score",
       color = "Class Type") +
  theme_minimal()

math_scores_plot1 + math_scores_plot2

No significant effect of K-3 class type on math scaled scores was observed in grades 4-8 in this new population of interest. In fact, based on the clear linear trend in mean math scores by class type across grades 4-8, fitting a linear regression model would provide a better generalization that math scores increase with grade level, and that K-3 class type does not have a significant effect on math scores in grades 4-8.

9.3.2 One-Way ANOVA for Reading Scaled Scores by Class Type (Grades 4-8)

We repeat the above analysis for reading scaled scores in grades 4-8 to investigate the impact of class type on reading achievement.

# One-way ANOVA model for read scaled scores by class type (grades 4-8)

aov_g4_read <- aov(g4treadss ~ cmpstype, data = subsetted_data)
aov_g5_read <- aov(g5treadss ~ cmpstype, data = subsetted_data)
aov_g6_read <- aov(g6treadss ~ cmpstype, data = subsetted_data)
aov_g7_read <- aov(g7treadss ~ cmpstype, data = subsetted_data)
aov_g8_read <- aov(g8treadss ~ cmpstype, data = subsetted_data)

# Kable for ANOVA results
aov_g4_read_summary <- broom::tidy(aov_g4_read)
aov_g5_read_summary <- broom::tidy(aov_g5_read)
aov_g6_read_summary <- broom::tidy(aov_g6_read)
aov_g7_read_summary <- broom::tidy(aov_g7_read)
aov_g8_read_summary <- broom::tidy(aov_g8_read)

kable(aov_g4_read_summary, caption = "ANOVA Summary for Read Scaled Scores by Class Type (Grade 4) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Read Scaled Scores by Class Type (Grade 4) in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 14208.83 7104.417 6.330984 0.0019634
Residuals 401 449988.72 1122.166 NA NA
kable(aov_g5_read_summary, caption = "ANOVA Summary for Read Scaled Scores by Class Type (Grade 5) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Read Scaled Scores by Class Type (Grade 5) in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 4932.341 2466.170 1.85976 0.1570085
Residuals 414 548992.710 1326.069 NA NA
kable(aov_g6_read_summary, caption = "ANOVA Summary for Read Scaled Scores by Class Type (Grade 6) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Read Scaled Scores by Class Type (Grade 6) in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 8735.97 4367.985 3.517853 0.0305551
Residuals 413 512806.51 1241.662 NA NA
kable(aov_g7_read_summary, caption = "ANOVA Summary for Read Scaled Scores by Class Type (Grade 7) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Read Scaled Scores by Class Type (Grade 7) in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 4076.161 2038.081 1.721074 0.1801512
Residuals 414 490255.019 1184.191 NA NA
kable(aov_g8_read_summary, caption = "ANOVA Summary for Read Scaled Scores by Class Type (Grade 8) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for Read Scaled Scores by Class Type (Grade 8) in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 2068.782 1034.391 0.7731823 0.4622073
Residuals 414 553864.086 1337.836 NA NA
# Line plot of mean read scores by class type for grades 4-8

read_scores_by_grade <- subsetted_data %>%
  group_by(cmpstype) %>%
  summarise(mean_g4_read = mean(g4treadss, na.rm = TRUE),
            mean_g5_read = mean(g5treadss, na.rm = TRUE),
            mean_g6_read = mean(g6treadss, na.rm = TRUE),
            mean_g7_read = mean(g7treadss, na.rm = TRUE),
            mean_g8_read = mean(g8treadss, na.rm = TRUE))

read_scores_by_grade_long <- read_scores_by_grade %>%
  pivot_longer(cols = starts_with("mean"), names_to = "Grade", values_to = "Mean_Read_Score")

read_scores_plot1 <- ggplot(read_scores_by_grade_long, aes(x = cmpstype, y = Mean_Read_Score, color = Grade)) +
                    geom_point(size = 3) +
                    geom_line(aes(group = Grade)) +
                    labs(title = "Mean Read Scaled Scores by Class Type (Grades 4-8)",
                         x = "Class Type",
                         y = "Mean Read Scaled Score") +
                    theme_minimal()

read_scores_by_grade <- subsetted_data %>%
  group_by(cmpstype) %>%
  summarise(mean_g4_read = mean(g4treadss, na.rm = TRUE),
            mean_g5_read = mean(g5treadss, na.rm = TRUE),
            mean_g6_read = mean(g6treadss, na.rm = TRUE),
            mean_g7_read = mean(g7treadss, na.rm = TRUE),
            mean_g8_read = mean(g8treadss, na.rm = TRUE)) %>%
  pivot_longer(cols = starts_with("mean"), 
               names_to = "Grade", 
               values_to = "Mean_Read_Score") %>%
  mutate(Grade = factor(Grade,
                        levels = c("mean_g4_read", "mean_g5_read", "mean_g6_read", "mean_g7_read", "mean_g8_read"),
                        labels = c("4", "5", "6", "7", "8")))


read_scores_plot2 <- ggplot(read_scores_by_grade, aes(x = Grade, y = Mean_Read_Score, group = cmpstype, color = cmpstype)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(x = "Grade",
       y = "Mean Read Scaled Score",
       color = "Class Type") +
  theme_minimal()

read_scores_plot1 + read_scores_plot2

Unlike math scaled scores, we actually observe significant effects of class type on reading scaled scores in grades 4 and 6. However, the effect is not consistent across all grades. We can run Tukey’s HSD post-hoc tests to identify which class types have significantly different reading scores in grades 4 and 6.

# Tukey's HSD post-hoc test for reading scaled scores by class type (grades 4 and 6)

aov_g4_read_tukey_results <- TukeyHSD(aov_g4_read)
aov_g6_read_tukey_results <- TukeyHSD(aov_g6_read)

kable(tidy(aov_g4_read_tukey_results), 
      caption = "Tukey's HSD Post-hoc Test for Read Scaled Scores by Class Type (Grade 4) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Tukey’s HSD Post-hoc Test for Read Scaled Scores by Class Type (Grade 4) in New Population of Interest
term contrast null.value estimate conf.low conf.high adj.p.value
cmpstype REGULAR-SMALL 0 -14.642573 -26.411027 -2.874118 0.0100981
cmpstype AIDE-SMALL 0 -10.838272 -19.372575 -2.303968 0.0083746
cmpstype AIDE-REGULAR 0 3.804301 -7.800143 15.408746 0.7209176
kable(tidy(aov_g6_read_tukey_results), 
      caption = "Tukey's HSD Post-hoc Test for Read Scaled Scores by Class Type (Grade 6) in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Tukey’s HSD Post-hoc Test for Read Scaled Scores by Class Type (Grade 6) in New Population of Interest
term contrast null.value estimate conf.low conf.high adj.p.value
cmpstype REGULAR-SMALL 0 -4.513128 -16.63037 7.604118 0.6557036
cmpstype AIDE-SMALL 0 -9.973412 -18.83193 -1.114891 0.0228173
cmpstype AIDE-REGULAR 0 -5.460284 -17.41967 6.499102 0.5308003
aov_g4_read_tukey_df <- as.data.frame(aov_g4_read_tukey_results$cmpstype)
aov_g4_read_tukey_df$comparison <- rownames(aov_g4_read_tukey_df)

colnames(aov_g4_read_tukey_df) <- c("Difference", "Lower_CI", "Upper_CI", "p_value", "Comparison")

g4_read_tukey <- ggplot(aov_g4_read_tukey_df, aes(x = Comparison, y = Difference)) +
  geom_point(aes(color = p_value < 0.05), size = 3) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Tukey's HSD: Reading Scaled Scores by Class Type (Grades 4 & 6)",
       x = "Class Type Comparison",
       y = "Mean Read Scaled Scores Diff. (Gr 4)") +
  # Hide legends
  guides(color = FALSE) +
  coord_flip() +
  theme_minimal()

aov_g6_read_tukey_df <- as.data.frame(aov_g6_read_tukey_results$cmpstype)
aov_g6_read_tukey_df$comparison <- rownames(aov_g6_read_tukey_df)

colnames(aov_g6_read_tukey_df) <- c("Difference", "Lower_CI", "Upper_CI", "p_value", "Comparison")

g6_read_tukey <- ggplot(aov_g6_read_tukey_df, aes(x = Comparison, y = Difference)) +
  geom_point(aes(color = p_value < 0.05), size = 3) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(y = "Mean Read Scaled Scores Diff. (Gr 6)",
       # No x-axis
       x = NULL) +
  coord_flip() +
  theme_minimal()

g4_read_tukey + g6_read_tukey

We can see that in grade 4, students in small classes have significantly higher reading scores compared to students in regular and regular-aide classes. In grade 6, students in small classes have significantly higher reading scores compared to students in regular-aide classes, but not compared to students in regular classes.

9.3.3 Summary of Supporting Analysis 2

We notice no significant effect of class type on math scaled scores in grades 4-8. However, we observe significant effects of class type on reading scaled scores in grades 4 and 6. While difficult to generalize across all grade levels, the results suggest that K-3 class type may have a lasting impact on reading achievement in certain grades.

9.4 Supporting Analysis 3: Impact of Class Size on High School GPA

We now explore further into the students’ academic career, focusing on high school GPA. We will investigate the impact of K-3 class type on high school GPA.

# histogram of high school GPA

gpa_hist <- ggplot(subsetted_data, aes(x = hsgpaoverall)) +
                    geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
                    labs(title = "Distribution of High School GPA",
                         x = "High School GPA",
                         y = "Frequency") +
                    theme_minimal()
# Side by side boxplots of high school GPA by class type (`cmpstype`)
gpa_boxplot <- ggplot(subsetted_data, aes(x = cmpstype, y = hsgpaoverall)) +
                        geom_boxplot(fill = "skyblue",
                                      color = "black") +
                        geom_jitter(width = 0.2, color = "navy", alpha = 0.5) + 
                        labs(title = "High School GPA by Class Type",
                             x = "Class Type",
                             y = "High School GPA") +
                        theme_minimal()

gpa_hist + gpa_boxplot

We also employ a one-way ANOVA to test the null hypothesis that the mean high school GPA is the same across different class types. The alternative hypothesis is that at least one class type has a different mean high school GPA.

The model is as follows:

\[\begin{align*} Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \end{align*}\]

where:

  • \(Y_{ij}\): High school GPA for the \(j\)-th student in the \(i\)-th class type
  • \(\mu\): Overall mean high school GPA
  • \(\alpha_i\): Main effect of the \(i\)-th class type
  • \(\epsilon_{ij}\): Random error in the high school GPA for the student \(j\) in the \(i\)-th class type

The independence, normality, and homoscedasticity assumptions are required for the validity of the ANOVA results.

# One-way ANOVA model for high school GPA by class type

aov_gpa <- aov(hsgpaoverall ~ cmpstype, data = subsetted_data)

# Summary of ANOVA

aov_gpa_summary <- broom::tidy(aov_gpa)

kable(aov_gpa_summary, caption = "ANOVA Summary for High School GPA by Class Type in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for High School GPA by Class Type in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 43.27034 21.63517 0.5579172 0.5728504
Residuals 395 15317.49209 38.77846 NA NA

The one-way ANOVA results show that the class type does not have a significant effect on high school GPA in this new population of interest.

9.5 Supporting Analysis 4: Impact of Class Size on College Readiness Exam (ACT/SAT Scores)

We will create a new column college_readiness that keeps the ACT and SAT (converted to ACT scale) scores for students. If a student has taken both exams, we will use the maximum score. This combined score of college readiness exams will serve as a proxy for college readiness.

subsetted_data <- subsetted_data %>%
  mutate(college_readiness = pmax(hsactcomp, hsactconverted, na.rm = TRUE))

Similar to the previous analyses, we will investigate the impact of K-3 class type on college readiness scores using a one-way ANOVA.

# Histogram of college readiness scores

readiness_hist <- ggplot(subsetted_data, aes(x = college_readiness)) +
                          geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
                          labs(title = "Distribution of College Readiness Scores",
                               x = "College Readiness Score",
                               y = "Frequency") +
                          theme_minimal()


# Side by side boxplots of college readiness scores by class type (`cmpstype`)

readiness_boxplot <- ggplot(subsetted_data, aes(x = cmpstype, y = college_readiness)) +
                              geom_boxplot(fill = "skyblue",
                                            color = "black") +
                              geom_jitter(width = 0.2, color = "navy", alpha = 0.5) +  
                              labs(title = "College Readiness Scores by Class Type",
                                   x = "Class Type",
                                   y = "College Readiness Score") +
                              theme_minimal()

readiness_hist + readiness_boxplot

The one-way ANOVA model for college readiness scores by class type is as follows:

\[\begin{align*} Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \end{align*}\]

where:

  • \(Y_{ij}\): College readiness score for the \(j\)-th student in the \(i\)-th class type
  • \(\mu\): Overall mean college readiness score
  • \(\alpha_i\): Main effect of the \(i\)-th class type
  • \(\epsilon_{ij}\): Random error in the college readiness score for the student \(j\) in the \(i\)-th class type

The independence, normality, and homoscedasticity assumptions are required for the validity of the ANOVA results.

# One-way ANOVA model for college readiness scores by class type

aov_readiness <- aov(college_readiness ~ cmpstype, data = subsetted_data)

# Summary of ANOVA

aov_readiness_summary <- broom::tidy(aov_readiness)

kable(aov_readiness_summary, caption = "ANOVA Summary for College Readiness Scores by Class Type in New Population of Interest") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
ANOVA Summary for College Readiness Scores by Class Type in New Population of Interest
term df sumsq meansq statistic p.value
cmpstype 2 35.1403 17.57015 0.9244183 0.3975795
Residuals 414 7868.7782 19.00671 NA NA

Again, the one-way ANOVA results show that the K-3 class type does not have a significant effect on college readiness scores in this new population of interest.

10 Summary of Supporting Analyses

  1. Immediate (Very Short Term) Effect of Class Size on Academic Performance: The results are consistent with the initial analysis, showing that students in small classes have significantly higher math scores compared to students in regular and regular-aide classes.

  2. Impact of Class Size on Grades 4-8 Achievement: No significant effect of K-3 class type on math scaled scores in grades 4-8 was observed. However, significant effects of class type on reading scaled scores were observed in grades 4 and 6. Due to the inconsistency across grades, generalizing the effect of K-3 class type on reading achievement is challenging.

  3. Impact of Class Size on High School Achievement: The class type does not have a significant effect on high school GPA or college readiness scores in this new population of interest.

10.1 Answering the Extended Research Question and Discussion

The extended research question was:

For students who complete both primary and secondary education with the objective of pursuing higher education (college), does the exposure to small class sizes in early education (K-3) have a significant impact on their high school academic performance and college readiness?

Based on the supporting analyses conducted, we can conclude that the effects of class size on student achievement do not persist beyond the early grades (K-3) to significantly influence high school GPA or college readiness in this new population of interest. While there were some significant effects on reading scores in grades 4 and 6, the overall impact on high school GPA and college readiness was not significant.

The results suggest that while small class sizes may have immediate short-term benefits on academic performance, these effects do not appear to have a lasting impact on high school academic achievement or college readiness. Other factors, such as individual student characteristics, school resources, and teaching quality, may play a more significant role in determining high school GPA and college readiness. These are interesting areas that may be more fit for an observational study to explore the complex interplay of various factors on long-term academic outcomes.

11 Notes on Causal Effects

Ideally, we’d like to draw causal conclusions about the impact of K-3 class type on student achievement. For causal inference to be valid, the following assumptions must be satisfied:

  1. Random Assignment (Ignorability):

    • Project STAR’s randomized controlled trial design ensures that students and teachers were randomly assigned to class types (small, regular, regular with aide). Randomization ideally removes biases stemming from selection effects, making the treatment (class size) independent of potential confounders.
    • However, operational adjustments after kindergarten (redistribution between regular and regular-aide classes) might introduce minor complications. Our analysis mitigates this by focusing on students who experienced consistent class types for at least three years.
  2. Stable Unit Treatment Value Assumption (SUTVA):

    • SUTVA requires that the outcome for one student is unaffected by the class assignment of other students. In the context of Project STAR, this assumption might be violated if there are peer effects across classrooms or schools. For example, teachers sharing resources or strategies between class types could potentially blur treatment distinctions.
    • Given the controlled randomization within each school, and the relatively small scale of class sizes in the small-class condition, peer effects across class types likely have limited impact, thus reasonably satisfying SUTVA.
  3. Consistency:

    • Consistency assumes that the treatment definition is stable and consistent across participants. Project STAR clearly defines treatments (class sizes and presence or absence of aides). Nevertheless, minor variations in classroom management and instructional quality could exist. The extent of these variations was minimized through standardized procedures, enhancing the consistency assumption.
  4. Positivity:

    • Positivity requires a non-zero probability of assignment to each treatment for all individuals in the population. This was ensured by Project STAR’s randomized design, with all eligible students equally likely to be assigned to any class type at enrollment.

11.0.1 Evaluating the Validity of Causal Conclusions

Given the robustness of Project STAR’s experimental design, we can reasonably interpret observed effects, particularly immediate short-term impacts, as causal. The significant short-term effect of smaller class sizes on student math achievement is credible.

However, for the longer-term effects (grades 4–8 and beyond), causal interpretations become less straightforward due to:

  • Attrition and Missing Data: Students who dropped out or did not complete all assessments could bias longitudinal results. While our analysis restricts the population to students with complete longitudinal data, this selectivity introduces potential biases.
  • Diminishing Treatment Effect: The lack of significant sustained effects beyond early grades may indicate a dilution of the treatment effect over time or increasing influence from other environmental and educational factors.

11.0.2 Conclusion on Causality

In summary, while the assumptions for causal inference are largely satisfied, particularly in the short-term analysis due to the randomized controlled trial design, cautious interpretation is advised for long-term causal claims. Immediate effects of small class sizes can be confidently viewed as causal, but long-term academic outcomes should be interpreted with consideration of potential attrition biases and external educational factors.

12 Model Diagnostics

Click to expand model diagnostics

The following section plots all the diagnostic plots for all the ANOVA models conducted in the supporting analyses. These plots help us assess the validity of the ANOVA results by checking the assumptions of independence, normality, and homoscedasticity.

# Model Diagnostics

# Set graphical parameters
par(mfrow = c(3, 2), mar = c(4, 4, 2, 1))

# List of ANOVA models
anova_models <- list(
  "Grade 1 Math" = aov1,
  "Grade 4 Math" = aov_g4_math,
  "Grade 5 Math" = aov_g5_math,
  "Grade 6 Math" = aov_g6_math,
  "Grade 7 Math" = aov_g7_math,
  "Grade 8 Math" = aov_g8_math,
  "Grade 4 Reading" = aov_g4_read,
  "Grade 5 Reading" = aov_g5_read,
  "Grade 6 Reading" = aov_g6_read,
  "Grade 7 Reading" = aov_g7_read,
  "Grade 8 Reading" = aov_g8_read,
  "High School GPA" = aov_gpa,
  "College Readiness" = aov_readiness
)

# Loop through models and generate diagnostics
for (model_name in names(anova_models)) {
  model <- anova_models[[model_name]]

  # Residuals vs. Fitted
  plot(model$fitted.values, resid(model),
       main = paste(model_name, "- Residuals vs Fitted"),
       xlab = "Fitted Values", ylab = "Residuals", pch = 19,
       col = rgb(0, 0, 1, alpha = 0.5))
  abline(h = 0, col = "red", lwd = 1.5, lty = 2)

  # QQ Plot
  qqnorm(resid(model), main = paste(model_name, "- Normal Q-Q"), pch = 19,
         col = rgb(0, 0.5, 0, alpha = 0.5))
  qqline(resid(model), col = "red", lwd = 1.5, lty = 2)
}

# Reset graphical parameters
par(mfrow = c(1, 1))

We note that based on these plots, not all variables satisfy the assumptions of normality and homoscedasticity. This further emphasizes the need for caution in interpreting the ANOVA results, particularly for long-term effects.

13 Acknowledgement

I acknowledge and thank my classmates Hangyu Li and Shang Chen for their helpful discussions and sharing unique approaches to the analysis.

14 References

  1. Achilles, C. M., Bain, H. P., Bellott, F., Boyd-Zaharias, J., Finn, J., Folger, J., Johnston, J., & Word, E. (2008). Tennessee’s Student Teacher Achievement Ratio (STAR) project [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/SIWH9F

  2. C.M. Achilles; Helen Pate Bain; Fred Bellott; Jayne Boyd-Zaharias; Jeremy Finn; John Folger; John Johnston; Elizabeth Word, 2008, “Tennessee’s Student Teacher Achievement Ratio (STAR) project”, https://doi.org/10.7910/DVN/SIWH9F, Harvard Dataverse, V1; Project STAR K-3

  3. C.M. Achilles; Helen Pate Bain; Fred Bellott; Jayne Boyd-Zaharias; Jeremy Finn; John Folger; John Johnston; Elizabeth Word, 2008, “Tennessee’s Student Teacher Achievement Ratio (STAR) project”, https://doi.org/10.7910/DVN/SIWH9F, Harvard Dataverse, V1; STAR follow-up Studies (1996-1997) Report

  4. C.M. Achilles; Helen Pate Bain; Fred Bellott; Jayne Boyd-Zaharias; Jeremy Finn; John Folger; John Johnston; Elizabeth Word, 2008, “Tennessee’s Student Teacher Achievement Ratio (STAR) project”, https://doi.org/10.7910/DVN/SIWH9F, Harvard Dataverse, V1; starUsersGuide.pdf

  5. Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if (1st ed.). Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

15 Session Information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plotly_4.10.4     ggalluvial_0.12.5 tidyr_1.3.1       forcats_1.0.0    
##  [5] foreign_0.8-86    AER_1.2-14        survival_3.6-4    sandwich_3.1-1   
##  [9] lmtest_0.9-40     zoo_1.8-12        broom_1.0.7       MASS_7.3-60.2    
## [13] car_3.1-3         carData_3.0-5     patchwork_1.3.0   kableExtra_1.4.0 
## [17] knitr_1.48        ggthemes_5.1.0    ggplot2_3.5.1     dplyr_1.1.4      
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5      xfun_0.51         bslib_0.8.0       htmlwidgets_1.6.4
##  [5] lattice_0.22-6    crosstalk_1.2.1   vctrs_0.6.5       tools_4.4.1      
##  [9] generics_0.1.3    tibble_3.2.1      fansi_1.0.6       highr_0.11       
## [13] pkgconfig_2.0.3   Matrix_1.7-0      data.table_1.16.4 lifecycle_1.0.4  
## [17] compiler_4.4.1    farver_2.1.2      stringr_1.5.1     munsell_0.5.1    
## [21] htmltools_0.5.8.1 sass_0.4.9        lazyeval_0.2.2    yaml_2.3.10      
## [25] Formula_1.2-5     pillar_1.9.0      jquerylib_0.1.4   cachem_1.1.0     
## [29] abind_1.4-8       tidyselect_1.2.1  digest_0.6.37     stringi_1.8.4    
## [33] purrr_1.0.2       labeling_0.4.3    splines_4.4.1     fastmap_1.2.0    
## [37] grid_4.4.1        colorspace_2.1-1  cli_3.6.3         magrittr_2.0.3   
## [41] utf8_1.2.4        withr_3.0.1       scales_1.3.0      backports_1.5.0  
## [45] httr_1.4.7        rmarkdown_2.28    evaluate_1.0.0    viridisLite_0.4.2
## [49] rlang_1.1.4       glue_1.8.0        xml2_1.3.6        svglite_2.1.3    
## [53] rstudioapi_0.17.1 jsonlite_1.8.9    R6_2.5.1          systemfonts_1.2.1